Fundamental Latency Trade-offs in Architecting DRAM Caches Outperforming Impractical SRAM-Tags with a Simple and Practical Design
نویسندگان
چکیده
This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the “idealized” SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.
منابع مشابه
Addendum to “Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches”
Abstract The MICRO 2011 paper “Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches” proposed a novel die-stacked DRAM cache organization embedding the tags and data within the same physical DRAM row and then using compound access scheduling to manage the hit latency and a MissMap structure to make misses more efficient. This addendum provides a revised performan...
متن کاملExploring the Design Space of DRAM Caches
Die-stacked DRAM caches represent an emerging technology that offers a new level of cache between SRAM caches and main memory. As compared to SRAM, DRAM caches offer high capacity and bandwidth but incur high access latency costs. Therefore, DRAM caches face new design considerations that include the placement and granularity of tag storage in either DRAM or SRAM. The associativity of the cache...
متن کاملThe Effectiveness of SRAM Network Caches in Clustered DSMs
The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off: hit rat...
متن کاملAsynchronous DRAM Design and Synthesis
We present the design of a high performance on-chip pipelined asynchronous DRAM suitable for use in a microprocessor cache. Although traditional DRAM structures suffer from long access latency and even longer cycle times, our design achieves a simulated core sub-nanosecond latency and a respectable cycle time of 4.8ns in a standard 0.25um logic process. We also show how the cycle time penalty c...
متن کاملOptimizing Cache Utilization in Modern Cache Hierarchies
Memory wall is one of the major performance bottlenecks in modern computer systems. SRAM caches have been used to successfully bridge the performance gap between the processor and the memory. However, SRAM cache’s latency is inversely proportional to its size. Therefore, simply increasing the size of caches could result in negative impact on performance. To solve this problem, modern processors...
متن کامل